Skip to content

migration for inactive user purge #6676

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

smithellis
Copy link
Contributor

1 - get all users who we consider inactive (3 years since login)
2 - divide into users having content and users without content
3 - hard deletes non-content users using the _base_manager
4 - pushes other users through the deletion pipeline
5 - implements batching to manage resources

Comment on lines 17 to 18
utils_module = importlib.import_module('kitsune.users.utils')
delete_user_pipeline = utils_module.delete_user_pipeline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the need for this. I think it's safe to do from kitsune.users.utils import delete_user_pipeline outside of this function.

Comment on lines 60 to 70
last_id = 0
while True:
batch = list(
query.filter(id__gt=last_id)
.order_by('id')
.annotate(has_content=has_content_criteria)
[:batch_size]
)
if not batch:
break
last_id = max(user.id for user in batch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could replace all of this and gain efficiency (fewer queries) by using the iterator method that Django provides (as we do in 0034_batch_delete_non_migrated_users.py). So something like:

current_batch = []
for user in query.annotate(has_content=has_content_criteria).iterator(chunk_size=batch_size):
        current_batch.append(user)
        if len(current_batch) >= batch_size:
            # Do the work
            ...
            current_batch = []

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked myself out of iterator when I was sorting into two groups; but when I looked at your note below, I am back to being in camp iterator().

Comment on lines 72 to 73
users_with_content = [user for user in batch if user.has_content]
users_no_content = [user for user in batch if not user.has_content]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you're iterating over 1k users twice, and then later iterating over the users_no_content to get their id's. You could do all of that in one iteration, something like:

for user in current_batch:
    if user.has_content:
        # run the pipeline
    else:
        users_no_content.append(user.id)
else:
    User._base_manager.filter(id__in=users_no_content).delete()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great catch. Thanks!

Copy link
Contributor

@escattone escattone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+wc

# Progress reporting every 1000 users
if processed_count % 1000 == 0:
elapsed_time = time.time() - start_time
progress_pct = (processed_count/total_users*100) if total_users > 0 else 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. Since you've already checked if total_users is zero above (and returned in that case), you don't need to check again here, so I think you can remove if total_users > 0 else 0

Comment on lines 92 to 94
avg_time = elapsed_time / processed_count if processed_count > 0 else 0
remaining_time = (total_users - processed_count) * avg_time if processed_count > 0 else 0
current_rate = processed_count / elapsed_time * 60 if elapsed_time > 0 else 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. Same here. I think for all of these lines, the processed_count and elapsed_time values are guaranteed to be greater than zero, so I don't think you need the if x > 0 else 0 checks.

@escattone
Copy link
Contributor

@smithellis Forgot to mention one more thing. Just FYI, you can use migrations.RunPython.noop instead of defining a reverse function that does nothing. It's equivalent.


cutoff_date = timezone.now() - timedelta(days=3*365)

query = User.objects.filter(last_login__lt=cutoff_date).annotate(has_content=has_content_criteria)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you using annotate here instead of filter/exclude/Exists?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot pass a Q object directly to annotate(). Not at least in version 4.2.+ that we are using. Annotation in 5.2 vs 4.2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use case absolutely works - I think it's just not specifically called out in the 4.2 docs but is in later docs. I can run this query and see the output and it builds valid sql which executes properly.

I'm annotating here so we can later divide our users into those with and those without content, so we can execute a quicker delete process on non-content users.

@@ -0,0 +1,131 @@
from datetime import timedelta
import time
import importlib
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this used anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, leftover from prior change. Weird pre-commit didn't fuss at me.

"""
Delete users who haven't logged in for over three years.
"""
User = get_user_model()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be User = apps.get_model("auth", "User") in migrations similar to the previous one (0034)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary because the delete_user_pipeline function needs an actual User instance vs. a historical model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants